The Wikipedia Corpus

提供者:刘唯

简介

该数据集是维基百科全文的集合。它包含来自400多万篇文章的将近19亿单词。这是个强大的NLP数据集–你可以通过单词,短语或段落来进行检索。

大小

20MB

数量

4,400,000篇文章,19亿单词

地址

https://nlp.cs.nyu.edu/wikipedia-data/

相关论文

[1]Mohamad Mehdi,Chitu Okoli,Mostafa Mesgari,Finn Årup Nielsen,Arto Lanamäki. Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus[J]. Information Processing and Management,2016.
[2]Joel Nothman,Nicky Ringland,Will Radford,Tara Murphy,James R. Curran. Learning multilingual named entity recognition from Wikipedia[J]. Artificial Intelligence,2013,194.